This document summarises the data from https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.001081. This study was produced by Michael Hall from Zam’s group and describes the development of the DRPRG tool that allows for drug resistance prediction using genome graphs.
In doing so, Michael collated and carefully(!) curated an enriched WHO dataset, which is summarised here.
(this excludes 527 instances of duplicated biosample IDs with different runIDs).
There is at least one drug phenotype for each sample for 22 drugs: amikacin, bedaquiline, capreomycin, ciprofloxacin, clofazimine, cycloserine, delamanid, ethambutol, ethionamide, gatifloxacin, isoniazid, kanamycin, levofloxacin, linezolid, moxifloxacin, ofloxacin, para-aminosalicylic_acid, pyrazinamide, rifabutin, rifampicin, streptomycin, thioacetazone.
A breakdown of R/S phenotypes for the samples can be seen in the plot below:
pheno_data %>%
select(-run, -bioproject, -biosample) %>%
summarise(across(everything(), ~ list(R = sum(. == "R"), S = sum(. == "S")))) %>%
mutate(phenotype = c("R", "S")) %>%
pivot_longer(-phenotype, names_to = "antibiotic", values_to = "value") %>%
as.data.frame() %>%
mutate(value = as.numeric(as.character(value))) %>%
mutate(antibiotic = case_when(antibiotic == "para-aminosalicylic_acid" ~ "PAS",
TRUE ~ antibiotic)) %>%
ggplot(., aes(x = antibiotic, y = value)) +
geom_bar(aes(fill = phenotype), position = "dodge",stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0, size = 8)) +
geom_text(
stat = "identity",
aes(label = value, group = phenotype),
vjust = -1,
position = position_dodge(width = 0.9),
size = 2.5) +
ylab("# samples") +
ggtitle("WHO-enriched dataset, DRPRG")
According to the README.md in the data source, the phenotype data can
be found in config/illumina.samplesheet.csv.
It was synthesized by summarising the experiments found in
config/samplesheets/ using
workflow/notebook/notepad.ipynb. This notebook is large and
includes various analyses but the one we are interested in is “Data
Collation”.
The first steps involved creating a WHO base dataset with two different data sources.
“gentb”, which is labelled as WHO correspondance (either a subset or superset of 2.).
“WHO”, which I’m assuming is from the latest catalogue.
Lots of data cleaning, resolving discrepancies between the two datasets and extra metadata added using code from https://github.com/mbhall88/WHO-correspondence/blob/main/docs/fill_in_who_samplesheet.py.
Next, data was sequentially added and cleaned data from various publications:
| dataset | publication | pheno_method |
|---|---|---|
| gentb | WHO correspondance | various |
| WHO | https://www.thelancet.com/journals/lanmic/article/PIIS2666-5247(21)00301-3/fulltext | various |
| trisakil | https://doi.org/10.1080/22221751.2022.2099304 | various |
| Smith | https://pubmed.ncbi.nlm.nih.gov/33055186/ | liquid MGIT 960 system (Bactec MGIT SIRE and PZA package inserts; Becton, Dickinson) and solid 7H10 agar proportion method |
| peker | https://doi.org/10.1099/mgen.0.000695 | incredibly not noted, only reference in methods is 'All MTB isolates were phenotypically tested for drug susceptibility (phenotypic DST)' |
| merker | https://doi.org/10.1038/s41467-022-32455-1 | various |
| finci | https://pubmed.ncbi.nlm.nih.gov/35907429/ | BACTEC MGIT 960 DST and Sensititre MYCOTB MIC plate (binary results reported) |
| leah_bdq | https://doi.org/10.1101/2022.12.08.519610 | BACTEC MGIT 960 DST |
| marco_pheno | https://doi.org/10.3389/fmicb.2023.1104456 | not stated, just that they were 'according to WHO classification' |
| lempens_acc | https://doi.org/10.1016/j.ijid.2020.08.042 | LJ slopes and 7H11 plates, proportional |
Then, the following was noted: “Get the BioProject of all BioSamples with antibiogram data in NCBI. Once I have the BioProject, I can …..download the antibiogram table”.
This resulted in an extra 1073 samples being added to the superset with phenotypes for at least one drug.
| bioproj | pheno_method_2 |
|---|---|
| PRJNA353873 | MGIT, MICs listed |
| PRJNA413593 | MGIT, MICs listed |
| PRJNA438921 | MGIT, MICs listed |
| PRJNA557083 | MGIT, MICs listed |
| PRJNA650381 | MGIT and proportional agar, MICs listed for both |
| PRJNA663350 | MGIT, MICs listed |
| PRJNA717333 | 96 well plate, MICs listed |
| PRJNA824124 | MGIT, MICs listed |
| PRJNA834625 | LJ slopes, MICs listed |
| PRJNA888434 | MGIT, MICs listed |